Skip to content

Conversation

@Baoyuantop
Copy link
Contributor

@Baoyuantop Baoyuantop commented Nov 24, 2025

Description

Configuring Consul service discovery in APISIX and restarting it while APISIX is continuously receiving traffic will result in frequent 503 errors.

In the design of Consul service discovery, only worker 0 directly pulls nodes from Consul and updates data. Other workers rely on events:register to receive broadcasts. If, when APISIX restarts, any worker has not yet completed events:register, but the service list data broadcast has already been sent, these workers will miss receiving data, and requests sent to that worker will return a 503 error.

This problem can be avoided by changing the event module to shared dict.

Which issue(s) this PR fixes:

Fixes #12398

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@Baoyuantop Baoyuantop changed the title fix: change consul event to shared dict fix: change consul event module to shared dict Nov 24, 2025
@Baoyuantop Baoyuantop marked this pull request as ready for review December 4, 2025 07:45
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Dec 4, 2025
Comment on lines +68 to +107
local function persist_all_services_to_shm()
if not consul_dict then
return
end

local data, err = core.json.encode(all_services)
if not data then
log.error("failed to encode consul services for shared dict: ", err)
return
end

local function discovery_consul_callback(data, event, source, pid)
all_services = data
log.notice("update local variable all_services, event is: ", event,
"source: ", source, "server pid:", pid,
", all services: ", json_delay_encode(all_services, true))
local ok, set_err = consul_dict:set(consul_dict_services_key, data)
if not ok then
log.error("failed to store consul services in shared dict: ", set_err)
return
end
end


local function sync_all_services_from_shm(force_log)
if not consul_dict then
return
end

local data = consul_dict:get(consul_dict_services_key)
if not data then
if force_log then
log.info("consul shared dict services empty")
end
return
end

local decoded, err = core.json.decode(data)
if not decoded then
log.error("failed to decode consul services from shared dict: ", err)
return
end

all_services = decoded
end
Copy link
Member

@nic-6443 nic-6443 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both functions writing to and reading from shared memory through all_services seem very dangerous. It may lead to re-writing the data just read from shared memory instead of getting the latest data from consul due to unexpected execution order. It is recommended to remove the use of the all_services variable, simplifying it so that only a privileged agent starts a timer to periodically fetch data from consul, while other workers only load data from shared memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: 503 errors occur after restarting APISIX when using Consul and APISIX both in Docker

2 participants